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Abstract 

A Bayesian approach to the evaluation of person fit in item response theory (IKT) 
models is presented. In a posterior predictive check, the observed value on a discrepancy 
variable is positioned in its posterior distribution. In a Bayesian framework, a Markov 
chain Monte Carlo procedure can be used to generate samples of the posterior distribution 
of the parameters of interest. These draws can also be used to compute the posterior 
predictive distribution of the discrepancy variable. The procedure is worked out in detail 
for the 3-parameter normal ogive model, but it is also shown that the procedure can be 
directly generalized to many other IKT models. Type I error rate and the power against 
some specific model violations are evaluated using a number of simulation studies. Index 
terms: Bayesian statistics, item response theory, person fit, model fit, 3 -parameter normal 
ogive model, posterior predictive check, power studies, type l error 
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Applications of item response theory (IRT) models to the analysis of test items, 
tests, and item score patterns are only valid if the IKT model holds. Fit of items can 
be investigated across persons and fit of persons can be investigated across items. Item 
fit is important because in psychological and educational measurement, instruments 
are developed that are used in a population of persons; item-fit then can help the test 
constructor to develop an instrument that fits an IKT model in that particular population. 
Item-fit statistics have been proposed by, for example, Mokken (1971), Andersen (1973), 
\fen (1981, 1984), Molenaar (1983), Glas (1988, 1999), and Orlando and Thissen (2000). 
As a next step, the fit of an individual’s item score pattern can be investigated. Although a 
test may fit an IRT model, persons may produce patterns that are unlikely given the model, 
for example, because they have preknowledge of the correct answers to some of the most 
difficult items. Investigation of person fit may help the researcher to obtain additional 
information about the answering behavior of a person. By means of a person-fit statistic, 
the fit of a score pattern can be determined given that the IKT model holds. Some statistics 
can be used to obtain information at a subtest level and a more diagnostic approach can 
be followed. Meijer and Sijtsma (1995; in press) give an overview of person-fit statistics 
proposed for various ERT models. 

To decide whether an item score pattern fits an IRT model, a sampling distribution 
under the null model, that is, the IRT model, is needed. Let t be the observed value of a 
person-fit statistic T. Then the significance probability or probability of exceedance is 
defined as the probability under the sampling distribution that the value of the test statistic 
is equal or smaller than the observed value, that is, p = P(T ^ f), or equal or larger than 
the observed value, that is, p = P(T > t), depending on whether low or high values of 
the statistic indicate aberrant item score patterns. As will be discussed below, for some 
statistics theoretical asymptotic or exact distributions are known which can be used to 
classify an item score pattern as fitting or nonfitting. An alternative is to simulate data 
according to an IRT model based on the estimated item parameters and then determine p 
empirically (e.g., Reise, 1995, 1999; Reise & Widaman, 1999; Meijer & Nering, 1997). 
However, the true values of both the item and person parameters are unknown, and the 
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uncertainty about these parameters is often not taking into account when simulating an 
empirical distribution. 

In this article, we will explore an alternative approach based on a Bayesian 
framework and posterior predictive checks using Markov chain Monte Carlo (MCMC) 
methods. For more information on posterior predictive checks, refer to Meng (1994), 
Gelman, Carlin, Stem, and Rubin (1995), and Gelman, Meng, and Stem (1996). In 
principle, this approach applies to any ERT model, but in this study we will focus on 
the 3-parameter normal ogive (3PNO) model. 

Compared to the traditional frequentist approach, this Bayesian approach has several 
advantages. First, there is no need to derive the theoretical sampling distribution of the 
statistic, which sometimes may be very difficult, if not impossible. Second, the person- 
fit statistic may depend on unknown quantities as the item and person parameters which 
uncertainty is explicidy taken into account. The third advantage pertains to generality of 
the procedure. Simulation studies have show that a fully Bayesian approach to estimation 
of the parameters in simple IRT models (say 1- or 2-parameter models) are generally 
not superior to estimates obtained by a maximum maiginal likelihood (MML) procedure 
or a Bayes modal procedure (see, for instance. Baker, 1998, or Kim, 2001). However, 
the Bayesian approach also applies to complicated IRT models, where MML or Bayes 
modal approaches pose important problems. Recently, the fully Bayesian approach has 
been adopted to the estimation of IRT models with multiple raters, multiple item types, 
missing data (Patz & Junker, 1997, 1999), testlet structures (Bradlow, Wainer & Wang, 
1999, Wainer, Bradlow & Du, 2000), latent classes (Hoijtink & Molenaar, 1997), models 
with a multi-level structure on the ability parameters (Fox & Glas, 2001) and the item 
parameters (Janssen, Tuerlinckx, Meulders & de Boeck, 2000), and multidimensional 
IRT models (Beguin & Glas, 2001). The motivation for the recent interest in Bayesian 
inference and MCMC estimation procedures is that the complex dependency structures in 
the mentioned models require the evaluation of multiple integrals to solve the estimation 
equations in an MML or Bayes modal framework (Patz & Junker, 1999). These problems 
are easily avoided in an MCMC framework. Procedures for the evaluation of model fit, 
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such as the procedures for the evaluation of person fit presented here, can be direcdy 
generalized. This point will be returned to in the discussion. In this article, several well- 
known person-fit statistics are generalized to the Bayesian framework. Note that Reise 
(2000) used empirical Bayes estimation methods in a logistic regression framework to 
determine the fit of an item score pattern. 

This paper is organized as follows. First, we will introduce some relevant IKT models 
and some person-fit statistics that are often used. Second, we will discuss the principles 
of MCMC methods to sample the posterior distribution of a person-fit statistic. Third, we 
will conduct a simulation study in which we will investigate how many persons and how 
many items are needed in the sample to apply this method in practice. Finally, we will 
conduct a simulation study to determine the effectiveness of several person-fit statistics. 



IRT models and Person Fit 

In IRT (Rasch, 1960; Bimbaum, 1968; Mokken, 1971; Lord, 1980; Hambleton & 
Swaminathan, 1985; van der Linden & Hambleton, 1997) the probability of a correct 
response on item j (j = 1 , k), Pj{6), is a function of the latent trait value 6 and a 
number of item characteristics. Often used models are the one, two, and three parameter 
logistic (1, 2, and 3PL) models (Hambleton & Swaminathan, 1985). For example, in 
the 3PL model, the item is characterized by a difficulty parameter 3 ] , a discrimination 
parameter otj and a (pseudo- )guessing probability 7 ., which is the lower asymptote of 
Pj(6) when 6 ► — 00 . Most person-fit studies have been conducted in the context of the 
logistic IRT models (Meijer & Sijtsma, in press). In a Bayesian framework, however, the 
3PNO model (e.g., Lord, 1980, pp. 13-14) has some computational advantages, although 
the 3PNO model and the 3PL model are completely equivalent for all practical purposes. 
In the 3PNO model, the probability of correctly answering an item is given by 



PM= 7 ; + (1 - 



( 1 ) 




7 



Bayesian Person Fit Indices - 5 



where $ denotes the standard normal cumulative distribution. 

To investigate the goodness-of-fit of item score patterns, several IRT-based person-fit 
statistics have been proposed. Most person-fit statistics have the form 

k 

m = (2) 

3 = 1 

where Yj is the response to item j, where the weight Vj(6) is often defined as an increasing 
function of the likelihood of the observed item scores. So the test is based on the 
discrepancy between the observed scores Yj and the expected scores under the model, 
Pj (6). A straightforward example of a member of the class defined by (2) is the W- 
statistic by Wright and Stone (1979), which is defined as 

w= jQ ^- m !L (3) 

£,.i Pim - pm 

A related statistic was proposed by Smith (1985, 1986) where the set of test items is 
divided into S non-overlapping subtests denoted A a (s = 1, ..., S). Then the unweighted 
between-sets fit statistic UB is defined as 



•s - 1 



(4) 



Other obvious members of the class defined by (2) are two statistics proposed by Tatsuoka 
(1984): Ci and C 2 - The Ci-statistic is the standardization with a mean of 0 and unit variance 
of 



k 



( 5 ) 
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where rij denotes the number of correct answers to item j and rij denotes the mean 
number of correctly answered items in the test. The index will be positive indicating 
misfitting response behavior when easy items are incorrectly answered and difficult items 
are correctly answered, and it will also be positive if the number of correctly answered 
items deviates from the overall mean score of the respondents. If a response pattern is 
misfitting in both ways, the index will obtain a large positive value. The ^-statistic is a 
standardization of 

k 

<5 = £WM - n][P#) - R/k] (6) 

j=l 

where R is the person’s number-correct score on the test. This index is sensitive to item 
score patterns with correct answers to difficult items and incorrect answers to easy items; 
the overall response tendencies of the total sample of persons is not important here. 
Another well-known person-fit statistic is the log-likelihood statistic 

fc 

‘ = £« log pm + (i - v) io g (i - pm)}, m 

j=i 



first proposed by Levine and Rubin (1979). It was further developed in Drasgow, Levine, 
and Williams (1985), and Drasgow, Levine, and McLaughlin (1991). Drasgow et al. 
(1985) proposed a standardized version l z of l which was purported to be asymptotically 
standard normally distributed; l z is defined as 



/-£(/) 

[Far(0]2’ 



( 8 ) 



where E ( l ) and Var(l) denote the expectation and the variance of l, respectively. These 
quantities are given by 



E (i) = £ {Pi (6) log [Pi (9)] + [1 - P, (0)] log (1 - Pi (0)]} , 



( 9 ) 
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and 

Var (/) = £ P, W [1 - Pi Ml [log 2 ■ (10) 

J=1 

It can easily be shown that l — E(l ) can be written in the form of Equation (2) by choosing 

v ‘ w = log {jrm) • (11) 

The assessment of person fit is usually contaminated with the estimation of 0. If 
6 is an estimate of 6 then the distributions of a person-fit statistic using 6 instead of 9 
will differ. For example, Molenaar and Hoijtink (1990) showed that the distribution of 
l z differs substantially from the standard normal distribution for short tests. Snijders (in 
press) derived expressions for the first two moments of the distribution: E |V(0)J and 
Var |V(0)J and performed a simulation study for relatively small tests consisting of 8 
and 15 items and for large tests consisting of 50 and 100 items, fitting the 2PL model, and 
estimating 0 by maximum likelihood. The results showed that the approximation was 
satisfactory at Type I error levels of a = 0.05 and a = 0.10, but that the empirical Type I 
error was smaller than the nominal Type I error for smaller values of a. In fact, both the 
distribution of l z and the version of l z corrected for 6, denoted /*, are negatively skewed 
(Snijders, in press; van Krimpen-Stoop & Meijer, 1999). This skewness influences the 
difference between nominal and empirical Type I error rates for small Type I error values. 
For example, Snijders (in press; see also Krimpen-Stoop and Meijer, 1999) found that 
for a 50-items test at a = .05 the discrepancy between the nominal and the empirical 
Type I error for l z and Z* at 6 = 0 was small (.001), whereas for a = .001 for both 
statistics it was larger (approximately .005). Van Krimpen-Stoop and Meijer (1999) found 
that increasing the item discrimination resulted in a distribution that was more negatively 
skewed. An alternative may be to use a x 2 -distribution; statistical theory that incorporates 
the skewness of the distribution is not yet available, however, for the 2PL and 3PL models. 
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Examples of person-fit tests outside the class defined by (2) are the uniformly most 
powerful (UMP) tests by Klauer (1991, 1995; see also Levine &Drasgow, 1988). Klauer’s 
approach entails an UMP test for testing whether a person’s item score pattern complies 
with the Rasch (1960) model against a specific alternative model that can be viewed as 
a generalization of the Rasch model. The statistic that forms the basis of an UMP test is 
the sufficient statistic for the parameters that have to be added to the null model (in this 
case the Rasch model) to define the alternative model. 

For example, consider the test of the Rasch model against an alternative model where 
the ability parameter differs between subtest Ai and subtest A?. Let be the individual’s 
ability on subtest A\ and let 02 be the individual’s ability on subtest A 2 . Furthermore, 
consider the number-correct score on the first and second subtest, respectively and let S 
= 9 1 — 62 - Then Ho: 5=0 can be tested against Hi: 5^0 using the number-correct 
score on either one of the subtests. Note that in an IRT model it is assumed that for each 
person the latent trait is invariant across items, if this is not the case this may point at 
aberrant response behavior. In contrast to, for example, calculating the log-likelihood as 
given in (7) or (8) we now explicitly test against an alternative hypothesis. So when the 
null hypothesis is rejected for a particular person this person can be classified as aberrant. 
We will denote the statistical test where we test if the total score on the first subtest is 
too high compared to what we expect based on the model as Ti, and we will denote the 
statistical test where we test if the test score on the second subtest is too high compared 
to what we expect on the basis of the model as T 2 . 

As another example, Klauer (1991, 1995) proposed a person-fit test for violation 
of the assumption of local independence using an alternative model proposed by 
Kelderman (1984, also see, Jannarone, 1986) where the probability of a response pattern 
(' Vi,-,Vj,-,Vk ) is given by 



P(y u ...,y j ,...,y k \9) oc exp 



" fc fc -1 

^2 Vi (0 ~ Pj) + ViVi+i s • 



J= 1 



i=i 



( 12 ) 
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Note that 6 models the dependency between yj and jjj+i. If 6 = 0 the model equals the 
Rasch model. An UMP test denoted as Tlag of the null hypothesis 5 = 0 can be based on 
a the sufficient statistic with realizations VjVj+i • 

When the item discrimination parameters aj are considered known, the principle 
of the UMP test can also be applied to the 2PL model. Analogous UMP tests for the 
3PL model and the normal ogive model cannot be derived because these models have no 
sufficient statistic for 6. Even though UMP tests do not exist for these models, the notion 
of using statistics related to the parameters of an alternative model as a basis of a test is 
intuitively appealing. Therefore, the generalizations of these tests to the 3PNO model in 
a Bayesian framework will also be studied below. 

Bayesian estimation of the 3PNO model 

In this study, an MCMC procedure will be used to generate the posterior distributions 
of interest. The MCMC chains will be constructed using the Gibbs sampler (Gelfand & 
Smiths, 1990). To implement the Gibbs sampler, the parameter vector is divided into a 
number of components, and each successive component is sampled from its conditional 
distribution given sampled values for all other components. This sampling scheme is 
repeated until the sampled values form stable posterior distributions. 

Albert (1992; see also Baker, 1998) applies Gibbs sampling to estimate the 
parameters of the well known 2PNO model (e.g., Lord & Novick, 1968). Johnson and 
Albert (1999, Section 6.9) generalized the procedure to the 3PNO. For application of the 
Gibbs sampler, it is important to create a set of partial posterior distributions that are easy 
to sample from. This often involves the data augmentation, that is, the introduction of 
additional latent variables that lead to a simple set of posterior distributions. In the Gibbs 
sampling algorithm, these latent variables are sampled along with the variables of interest. 
The present procedure is based on two data augmentation steps. The first step entails the 
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introduction of binary variables such that 



f 1 if person i knows the correct answer to item j 
\ 0 if person i does not know the correct answer to item j. 



( 13 ) 



So if Wij = 0 , person i guessed the response to item j, if = 1, person i knows the right 
answer and gives a correct response. The relation between W \j and the observed response 
variable is given by a model where $(77^), with 77^ = o.j 9 i — 0 j, is the probability 
that the respondent knows the item and gives a correct response with probability one, and 
a probability (1 — $(77^)) that the respondent does not know the item and guesses with 
7 j as die probability of a correct response. So the probability of a correct response is a 
sum of a term $(77^) and a term 7^(1 — $(77^)). Summing up we have 

P(Wij = 1 1 Yij = 1,77 oc $(77^) 

P(W {j = 0 1 = hVijrtj) oc 7j(l - 

( 14 ) 

P(Wij = 1 1 = 0 , Vijrfj) = 0 

P(Wij = 0 \Yij = 0 , 77 ^, 7 ^) = 1 . 



The second data augmentation step is derived using a rationale which is analogous to a 
rationale often used as a justification of the 2 PNO (see, for instance, Lord, 1980 , Section 
3 . 2 ). In that rationale, it is assumed that, if person i is presented item j, a latent variable 
Z x j is drawn from a normal distribution with mean 77^ and a variance equal to one. A 
correct response yy = 1 is given when the drawn value is positive. Analogously, in the 
present case, a variable Z {j is introduced with a distribution defined by 



f /V (77^,1) truncated at the left by 0 if W x j = 1 

\ Nirjij,!) truncated at the right by 0 if Wij = 0. 



( 15 ) 



The item parameters a have a prior p(cx,( 3 ) = f]£=i I (a? > 0 ), which insures that the 
discrimination parameters are positive. Note that this prior is uninformative with respect 
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to (3. The guessing parameter 7 . has the conjugate prior Beta(a, b ) . The ability parameters 
0 have a standard normal distribution, that is, p = 0 and a = 1 . 

The procedure described below is based on the Gibbs sampler. The aim of the 
procedure is to simulate samples from the joint posterior distribution of at, ( 3 , 7 , 0 , z and 
w, given the data y, which are the responses of n test takers to k items. This posterior 
distribution is given by 



p(a,/ 3 , 7 , 0 ,z,w|y) = p(z,w|y ■,a,(3,-y,0,)p('y)p(a,f3)p(0) 



= CUYl P(% | w ij , Vij )p(wij I Vij , Vij >7 j )v{lj )I(afj > 0 ) 

t=l j=l 



<t>(0i, p = 0,a = 1) 



(16) 



where p(it% \y ijt p^^j ) is given by (14) and p{z xj \ w ij} r) tj ) follows from (15). 

Although the distribution given by (16) has an intractable form, as a result of the two 
data augmentation steps, the conditional distributions of a, /3, *y, 0 , z and w are now each 
tractable and easy to sample from. A draw from the full conditional distribution can be 
obtained in the following steps. 

Step 1 The posterior p(z, w |y ; a,/3, 7 , 0) is factored as p(z |y ; w, a, /3, 7 , 0) 
p(w |y ; a, (3, y, 0), and values of w and z are drawn in two substeps: 

• Draw from the distribution of conditional on the data y and a, f3, 7 , 0 , given 
by (14). 

• Draw Zij from the conditional distribution of Zij given all other variables using (15), 



Step 2 Draw from the conditional distribution of 0 given the values z, w, a:, /3, 7 , and 
y. Since p(0 |y ; z, w, a, (3, 7 , 0) is proportional to p ( z \0 a, f3)p(0)p(-y, w, y |z, a, (3), 
and the last term also does not depend on 0 , it follows from the definition of given 
above that the error term in Z xj - f3j = a j 6 l + is a normally distributed. So the 
full-conditional distribution of 0 entails a normal model for the regression of Z l3 — f 3 j on 
Oj,with 0i as a regression coefficient which has a normal prior with parameters p, = 0 
and ( 7 = 1 . (see, for instance, Gelman, et al., 1995, p.45 and p.78) 
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Step 3 Draw from the conditional distribution of the parameters of item j, aj t and /3j. 
Analogous to the previous step, also this step entails sampling from a regular normal 
linear model. Defining Z <j = (Z lj} Z nj ) T , and X = ( 0 , - 1 ), with -1 being the n 
dimensional column vector with elements - 1 , the two items parameters can be viewed as 
regression coefficients in Zj = X(aij,/3 )) T + e, where e is a vector of random errors. 
So also this step boils down to sampling the regression coefficients in a regular Bayesian 
linear regression problem. 

Step 4 Sample from the conditional distribution of ■y j . The likelihood of toy, ...,w n j is 
a binomial with parameter ■jj. With the noninformative conjugate Beta prior introduced 
above, the posterior distribution of 7 ^ also follows a beta distribution (see, for instance, 
Gelman, et al., 1995, Section 2.1). 

So the procedure boils down to iteratively generating a number of sequences of 
parameter values using these four steps. Convergence can be evaluated by comparing the 
between- and within-sequence variance (see, for instance, Gelman, et al., 1995). Starting 
points of the sequences can be provided by the Bayes modal estimates of BELOG-MG 
(Zimowski, Muraki, Mislevy, & Bock, 1996). For more information on this algorithm 
refer to Albert (1992), Baker (1998), and Johnson and Albert (1999). 

In the Bayesian approach, the posterior distribution of the parameters of the 3PNO 
model, say p(£\y), is simulated using a Markov chain Monte Carlo (MCMC) method 
proposed by Johnson and Albert (1999). Person fit will be evaluated using a posterior 
predictive check based on an index T(y,£). When the Markov chain has converged, 
draws from the posterior distribution can be used to generate model-conform data y Tep 
and to compute a so-called Bayes p- value defined by 

Pr(r(S/^,{)>r(l/,£)|l/). (17) 

So person-fit is evaluated by computing the relative proportion of replications, that is, 
draws of £ from p(£|y), where the person-fit index computed using the data, T(y,£), 
has a smaller value than the analogous index computed using data generated to conform 
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the IRT model, that is T(y rep , ^.Posterior predictive checks are performed by inserting 
the person-fit statistics given in the previous section into Equation (17). After the bum- 
in period, when the Markov Chain has converged, in every n-th iteration (n > 1 ), 
using the current draw of the item- and person parameters, a person-fit index T(y, £) 
is computed, a new model-conform response pattern is generated, and a value T(y rep , () 
is computed. Finally, a Bayesian p - value is computed as the proportion of iterations were 
T(y rep ,0>T(y,Q. 



Simulation studies 

This simulation study consists of two parts. In the first part we will investigate the 
Type I error rate as a function of test length and sample size. In the second part we will 
investigate the detection rates of the different statistics for different model violations, test 
lengths, and sample sizes. Furthermore, we will investigate the impact of nonfitting item 
scores on the bias in 9 as a function of the number of test items affected by lack of model 
fit. In all three simulation studies we will use the statistics l, W, UB,C, i, C2> Tlag. T\ and 
T 2 as defined above. 

Study 1: Type I Error Rate 

Method 

The simulation studies with respect to the Type I error rate were performed in two 
conditions: one with random and one with fixed item parameters. In both conditions, the 
ability parameters were drawn from a standard normal distribution. In the first condition, 
for every replication the item parameters were drawn from the default prior distributions 
used in BILOG-MG. The guessing parameter 7 was drawn from a Beta(a, b ) distribution 
with a and b equal to 5 and 17, respectively. This results in a mean 7 of 0.20. Further, 
the item discrimination parameters were drawn from a lognormal distribution with mean 
zero and a variance equal to 0.5 and the item difficulty parameters /3 were drawn from a 
normal distribution, also with mean zero and variance 0.50. In the second condition, the 
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item parameters were fixed. The 7 was fixed to 0.20 for all items. Item difficulty and 
discrimination parameters were chosen as follows: 

• for a test length k = 30, three values of the discrimination parameter, 0.5, 1 .0, and 1 .5, 
were crossed with ten item difficulties = —2.00 + 0.40(i — 1), i = 1, ..., 10. 

• for a test length k = 60, three values of discrimination parameters, 0.5, 1.0, and 1.5, 
were crossed with twenty item difficulties = —2.00 + 0.20(i — 1 ), i = 1, ...,20. 

Three samples sizes were used: n = 100, n = 400, and n = 1000. The true values 
of the parameters Were used as starting values for the MCMC procedure. The procedure 
had a run length of 4000 iterations with a bum-in period of 1000 iterations. That is, 
the first 1000 iterations were discarded. In the remaining 3000 iterations, T(y rep ,£ ) and 
T(y, £) were computed every 5 iterations. So the posterior predictive checks were based 
on 600 draws. For the statistics that uses a partitioning of the items into subtests, the items 
were ordered according to their item difficulty (3 and then two subtests of equal size were 
formed, one with the difficult and one with the easy items. Finally, for every condition, 
100 replications were simulated and the proportion of replications with a Bayesian p- value 
less than 0.05 was determined. 

Results 



Insert Thble 1 and 2 about here 



The results for the condition with random item parameters are shown in Table 1; the 
results for the condition with fixed item parameters are shown in Table 2. It can be seen 
that, in general, the significance probabilities convene to their nominal value of 0.05 as 
a function of sample size and test length, and the nominal significance probability is best 
approximated by the combination of a test length k = 60 and a sample size n = 400 or 
n = 1000. Note that for n = 100 the significance probabilities are much too large. There 
are no clear effects for specific person-fit statistics, except that the U B and the Ti ag seem 
to be quite conservative for n = 400 and n = 1000 and random item parameter selection. 
Finally, at the bottom of the two tables, the mean over replications and simulees of the 
absolute difference between the true and the estimated ability parameters is given. This 
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mean absolute error (MAE) will be used to interpret the bias in the ability parameters of 
simulees with nonfitting response vectors in the simulation study discussed below. 

Study 2: Detection rates 

Method 

Guessing. In several studies problems are discussed of unmotivated persons that 
take a test which they do not have personal interest in. For example, Schmitt, Cortina, 
and Whitney (1993) noted the potential for suspicion and disdain among employees in a 
concurrent validation study, the authors predicted that factors including poor motivation 
and cheating may lead to inaccurate assessment of abilities for some employees. In 
such testing conditions persons may guess the correct answers to groups of items, or 
they may produce typical item score patterns like repeated patterns of item responses. 
Identifying these examinees prior to item calibration, equating, and score reporting may 
help to improve the usefulness of results from a large scale testing program. It has been 
suggested that person-fit statistics may be useful to detect such behavior (Haladyna, 1994, 
P-165). 

To evaluate the detection rate of guessing, a number of simulation studies were 
carried out. These studies generally had the same set-up as the Type I error rate studies 
(Study 1) under the condition with fixed item parameters, with the following alterations. 
The condition with sample size of n = 100 was not used because of its inflated Type 
I error rate. The data were generated in such a way that guessing occurred for 10% of 
the simulees, so data matrices with n = 400 simulees had 40 aberrant simulees, and data 
matrices with n = 1000 simulees had 100 aberrant simulees. For these aberrant simulees, 
guessing was imposed in three conditions, where 1 /6, 1/3, or 1/2 of the test was corrupted 
by guessing. So for the test with k = 30 items, the number of corrupted items was either 
5, 10, or 15, and for the test with k = 60 items, the number of corrupted items was either 
10, 20, or 30. Guessing was always imposed on the items with the lowest item difficulty. 
This was done because guessing on the easiest items has the most detrimental effect on the 
estimation of 9 (Meijer & Nering, 1997) and thus detection of these item score patterns is 
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important. The probability of a correct response to these items by aberrant simulees was 

0 . 20 . 

Test statistics were computed in the same way as in Study 1. So, again, for statistics 
based on a partitioning of the test, two subtests of equal size were formed: a difficult 
and an easy one. As a result, the corrupted items were in the easiest test, although the 
partitioning did not completely conform to the pattern of corrupted and uncorrupted items. 
So in this sense, the partition was not optimal. However, in real situations, there is usually 
no prior knowledge of which items are corrupted, so the setup was considered realistic. 

A final remark concerns the computation of Ti and T 2 . The latter was computed as 
described above, that is, its Bayesian p-value indicates how often the observed score was 
lower than the score replicated under the model. So a low p-value for T 2 indicates that 
the score on the second subtest was too high. However, in the simulation study, the item 
parameters were ordered from difficult to easy, and guessing was imposed on the easy 
items. Therefore, it is expected that the score on the easiest subtest will be too low, so for 
Ti the orientation of the test is changed from right-tailed (too high scores) to left-tailed 
(too low scores). That is, Ti should detect too low scores. 50 replications were simulated 
in every condition. 

Item disclosure. In high-stakes testing, persons may be tempted to obtain knowledge 
about the type of test questions or even about the correct answers to the items in the test. 
In computerized adaptive testing this is one of the major threats to the validity of test 
scores. But also in standardized paper-and-pencil tests this is a realistic problem. For 
example, in personnel selection commercial available tests are often used by different 
companies. This makes the threat of item disclosure realistic due to repeated test taking. 
Item disclosure may result in a larger percentage of correct answers than expected on the 
basis of the trait that is being measured. 

Note, that in general it is unknown on which and on how many items a person 
has knowledge of the correct answers. Item preknowledge on a few items will only 
have a minor effect on the number-correct score (Meijer & Nering, 1997). Also, item 
preknowledge of the correct answers on the easiest items in the test will only slightly 
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improve the number-correct score. This suggests that in particular item preknowledge 
on the items of median difficulty and on the most difficult items may have an effect on 
the total score. Thus, the effect of item preknowledge will be important in particular for 
persons with a low ability level that answer many difficult items correctly. 

The setup of the simulation study to the detection rate of the tests for item disclosure 
was analogous to the study to the detection rate for guessing. So data were generated for 
sample sizes of n = 400 and n = 1000 simulees, and test lengths of k = 30 and k = 60 
items, item disclosure was prominent for 10% of the simulees, and for these simulees, 
1/6, 1/3 or 1/2 of the difficult items in the test were corrupted. The probability of a 
correct response to these items was chosen to be 0.80. Test statistics were computed in 
the same way as in the guessing study, except for T\ and T 2 , which are now right-tailed 
as all other statistics in the study That is, both statistics are designed to detect scores that 
are too high. Again, 50 replications were simulated in every condition. 

Violations of local independence. When previous items provide new insights useful 
for answering the next item or when the process of answering items is exhausting, the 
assumption of local independence may be violated. This may result, for example, due 
to speeded testing situations or in situations were there is exposure to material among 
students CVfen, 1993; see also Embretson & Reise, 2000, pp. 231-233). 

The setup of the simulation study to the detection rate of the tests for violation of 
local independence was analogous to the studies to the detection rate of guessing and 
item disclosure. Again, data were generated for sample sizes of n = 400 and n = 1000 
simulees, and test lengths of k = 30 and k = 60 items, the model violation was 
imposed on 10% of the simulees, and for these simulees, 1 /6, 1/3 or 1 /2 of the test was 
corrupted. Responses to corrupted items were generated with the model defined by (12), 
with 8 = 1.0. In these simulations, the items were ordered such that the affected items 
succeeded each other. For the condition were 1/3 of the test was corrupted, the model 
violation was imposed on the items with a = 1.0. For the condition were 1/6 of the test 
was corrupted, the model violation was imposed on the items with a = 1.0 and the lowest 
item difficulties. For the condition were 1/2 of the test was corrupted, the model violation 
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was imposed on the items with a = 1.0 and, in the case when there were too few items 
with a = 1, on the items with a = 0.5 and the lowest item difficulties. The impact of a 
violation with 6 = 1.0 was an average increase in lag, X^j=i ViUj+u of 1.6, 4.0 and 5.9 
for a test of 60 items with 1/6, 1/3 or 1/2 of the items corrupted, respectively, and of 0.7, 
1.5 and 2.4 for a test of 30 items with 1/6, 1/3 or 1/2 of the items corrupted, respectively. 

Test statistics were computed in the same way as in the study of item disclosure 
reported above, with the exception that for T\ the focus was on higher-than-expected 
outcomes. Again, 50 replications were made in every condition. 

Results 

Guessing. The proportions of “hits”, that is, the proportion of correctly identified 
aberrant simulees are shown in Table 3. The proportions of “false alarms”, that is, the 
proportion of normal simulees incorrectly identified as aberrant, are shown in Table 4. 



Insert Table 3 and 4 about here 



The optimal condition for the detection of guessing is a large sample size and a large 
test length. Therefore, the results of the condition with n = 1000 simulees and k = 60 
items will be discussed first. The main overall trend for all tests is that the detection 
rate decreases as the number of affected items increases. This can be explained by the 
inflated MAE of 9 for the misfitting simulees (bottom Table 1). It can be seen that the 
MAE for the misfitting simulees is grossly inflated, where the MAE is larger forp =1/2 
and p = 1/3 than for p = 1/6. Comparing these results with the results in Table 1, 
it can also be concluded that the presence of 10% misfitting simulees in the calibration 
sample affected the MAE for the fitting simulees to some degree. As the number of 
affected items increases, the MAE also increases, and since the fit statistics are computed 
conditionally on 8, the detection rate decreases. Inspection of the results in the condition 
with n = 400 simulees and k = 60 shows that the detection rate is little affected by the 
smaller calibration sample. Furthermore note that the detection rate of Ti is lower than 





Bayesian Person Fit Indices - 19 



the detection rate of T 2 . So the bias in 6 is such that the low scores on the first part of the 
test are less unexpected than the relatively high scores on the second part of the test. 

For a test length of k = 30 items, the detection rate is slightly less than for k = 60 
items. This is as expected, because the statistics are computed on an individual level and 
on this level the test length is the number of observations on which the test is based. Note 
that the relatively low detection rates of Tlag and Ti found for k = 60 also applies for 
k = 30. Finally, at the bottom of the table it can be seen that the MAE was less inflated 
than for the study with k = 60. The explanation is that the absolute numbers of affected 
items that was responded to is lower here. 

Item disclosure. The proportions of hits and false alarms are shown in Table 5 and 6, 
respectively. It can be concluded that the effects of test length and proportion of affected 
items are also found here. Furthermore, the absence of an effect of calibration sample 
size is replicated here. The detection rates of Ti ag are relatively low. 



Insert Table 5 and 6 about here 



Remember that now the items in the second part of the test, that is, the easy items, were 
affected by the model violations. Therefore, it was expected that T 2 would be sensitive 
to the increase in the total score on the second half of the test. Table 5 shows that this 
expectation was confirmed (e.g., detection rates between 0.24 and 0.50 for n = 1000 
and k = 30). However, note that for Ar = 30 the detection rate of Ti was also relatively 
large (between 0.27 and 0.30) with, contrary to T 2 , a high false alarm rate (between 0.27 
and 0.29). Thus, the bias of the ability estimate caused by the model violation was large 
enough to affect the simulation of the predictive distribution of T\. In practice this is 
undesirable, because one does not know a priori which part of the test is affected and the 
interpretation of the outcome of Ti and T 2 is problematic. 
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Violations of local independence. The proportions of hits and false alarms are shown 
in Table 7 and 8, respectively. 



Insert Table 7 and 8 about here 



Although for the aberrant simulees the increase in lag reported above was 
considerable, in the two bottom lines of Table 7 it can be seen that the resulting bias in their 
ability estimates was not impressive. Also the detection rate of the tests was negligible. 
Even the power of the 2} ofl -test, which is specially targeted at this model violation was 
negligible. The only exception was the Ti-test. The reason is that the affected items were 
placed at the beginning of the test, and the increase in lag also resulted in an increase in 
the total score on the first part of the test. Note, however, that in Table 8 it can be seen 
that the false alarm rate of this test also increased. 

Discussion 

Aberrant response behavior in psychological and educational testing may result in 
inadequate measurement of some persons. Therefore, misfitting item scores should be 
detected and removed from the sample. To classify an item score pattern as nonfitting, 
the researcher can simply take the top 1 or top 5 percent of aberrant cases or he/she use 
a theoretical sampling distribution or can simulate datasets based on the estimated item 
parameters in the sample. In the first case person-fit statistics are used as descriptive 
statistics. In this study, we followed the approach in which we used person-fit statistics to 
test the hypothesis that an item score pattern is not in agreement with the underlying test 
model. Simulation methods thus far applied in the literature did not take into account the 
uncertainty of parameters of the IRT model. In this study, we used Bayesian methods that 
take into account this uncertainty to classify an item score pattern as fitting or nonfitting. 
Although Bayesian methods are statistically superior to other simulation methods, a 
drawback is that they are relatively complex and computational intensive. 
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Depending on the type of data and the problems envisaged, a researcher may choose a 
particular person-fit statistic, although not all statistics have equally favorable properties 
in a statistical sense. In general, sound statistical methods have been derived for the 
Rasch model, but because this model is rather restrictive to empirical data, the use of 
these statistics is also restricted. For the 2PL model and the 3PL model and for short tests 
and tests of moderate length (say, 10-60 items) due to the use of 9 rather than 6 , for most 
statistics the nominal Type I error rate under the standard normal distribution is not in 
agreement with the empirical Type I error rate (van Krimpen-Stoop & Meijer, 1998). As 
an alternative one may use the correction proposed by Snijders (in press) or one may use 
Bayesian simulation procedures discussed in this paper. 

From the results it can be concluded that even for a test as short as 30 items and for 
400 simulees the type I error is well under control (approximately 0.03 at an nominal for 
most statistics studied). In particular it is interesting to compare these results with the 
results obtained using the theoretical distribution. For example, Snijders (in press) found 
using the 2PL model and the log-likelihood statistic corrected for 9 in a simulation study 
with 100,000 replications for nominal type I error rates a = .05 (resulting in standard 
errors between 0.001 and 0.005) for a 15-items test empirical type I errors between 0.053 
and 0.061. Note, however, that he considered the item parameters known. This is only 
realistic when the item parameters can be estimated very accurately, that is, for very large 
sample sizes. For small sample sizes the method proposed in this study may be more 
suitable. 

Detection rates differed for different statistics and different types of model violations 
simulated. In general, it can be concluded that the detection rates for guessing and item 
disclosure were higher than for violations against local independence. Note, however, 
that also the MAE was relatively small in the latter case, in contrast to the MAE for 
the guessing condition. Also for item disclosure, the MAE was often slightly larger for 
misfitting score patterns compared to the MAE for fitting score patterns, although the 
power of some person-fit statistics was high (Table 5). Aggregated over all conditions, 
the C 2 ' test had the highest power. The expectation that the UMP tests for person fit in 
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the Rasch model (7\, T 2 , and Ti ag ) may also be superior in the framework of a 3PL 
model in an Bayesian framework was not corroborated. Traditional discrepancy tests 
do better here. Not reported above, but also included in the study were versions of 7\, 
T 2 , and Ti a g where the item scores were weighted by the discrimination parameters. The 
detection rates of these tests were consistently lower than those of the tests based on the 
unweighted scores. Interesting was that the detection rates decreased when the number 
of items affected by guessing increased. This is contrary to findings in earlier studies 
(Meijer & Sijtsma, 2001). This may be explained as follows. In the present study a larger 
amount of guessing resulted in a lower 9 than the original 9. As a result of using this lower 
9, item score patterns are less aberrant than using the original 9. In other studies, the 9 is 
often fixed, and as a result item score patterns are more often classified as misfitting. 

A final remark concerns the generalization of the procedure presented here to a 
general ERT framework incorporating models with multiple raters, testlet structures, latent 
classes, and multi-level structures (references given above). The common theme in these 
models is their complex dependency structure and the fact that these complex models 
can be estimated using the Gibbs sampler. In all cases, the structure of the estimation 
procedure is analogous: draws from the posterior distribution are made by partitioning the 
complete parameter vector into a number of components and sampling each component 
conditionally on the draws for the other component. Usually, the partition of the complete 
parameter vector is in the item parameters, the person parameters, augmented data (such 
as Z and W above) and hyperparameters which may be related to some restrictions on 
the parameters (as in testlet and other multilevel IRT models) and some of the priors. In 
all these models, the statistics described above can be computed given the current draw 
of the item and person parameters, both for the observed data y and replicated data y re P 
drawn from the posterior predictive distribution. 
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Actual Type I Error Rates for a Nominal a = .05 Test 
Random Item Parameters 
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Table 2 

Actual Type I Error Rates for a Nominal a = .05 Test 
Fixed Item Parameters 
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